EMPIRICAL MARGIN DISTRIBUTIONS AND BOUNDING THE 

GENERALIZATION ERROR OF COMBINED CLASSIFIERS 



We prove new probabilistic upper bounds on generalization error of complex classifiers that are 
combinations of simple classifiers. Such combinations could be implemented by neural networks or 
by voting methods of combining the classifiers, such as boosting and bagging. The bounds are in 
terms of the empirical distribution of the margin of the combined classifier. They are based on the 
methods of the theory of Gaussian and empirical processes (comparison inequalities, symmetrization 
method, concentration inequalities) and they improve previous results of Bartlett (f998) on bounding 
the generalization error of neural networks in terms of £i-norms of the weights of neurons and of Schapire, 
Freund, Bartlett and Lee (1998) on bounding the generalization error of boosting. We also obtain rates of 
convergence in Levy distance of empirical margin distribution to the true margin distribution uniformly 
over the classes of classifiers and prove the optimality of these rates. 
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1. Introduction Let (X, Y) be a random couple, where X is an instance in a space S and Y G { — 1, 1} 
is a label. Let Q be a set of functions from S into R. For g E G, sign(g(V)) will be used as a predictor (a 
classifier) of the unknown label Y. If the distribution of (X, Y) is unknown, then the choice of the predictor 
is based on the training data (X\,Y\), . . . , (X n ,Y n ) that consists of n i.i.d. copies of (X,Y). The goal of 
learning is to find a predictor g <E Q (based on the training data) whose generalization (classification) error 
P{Yg(X) < 0} is small enough. In this paper, our main concern is to find reasonably good probabilistic 
upper bounds on the generalization error. The standard approach to this problem was developed in seminal 
papers of Vapnik and Chervonenkis in the 70s and 80s (see Vapnik (1998), Devroye, Gyorfi and Lugosi (1996), 
Vidyasagar (1997)) and it is based on bounding the difference between the generalization error F{Yg(X) < 0} 
and the training error 

n 

ri " 1 X] 7 {^s(^)<0} 

3=1 

uniformly over the whole class Q of classifiers g. These bounds are expressed in terms of data dependent 
entropy characteristics of the class of sets {{(x,y) : yg(x) < 0} : g € Q} or, frequently, in terms of the so 
called VC-dimension of the class. It happened, however, that in many important examples (for instance, in 
neural network learning) the VC-dimension of the class can be very large, or even infinite, and that makes 
impossible the direct application of Vapnik-Chervonenkis type of bounds. Recently, several authors (see 
Bartlett (1998), Schapire, Freund, Bartlett and Lee (1998), Anthony and Bartlett (1999)) suggested another 
class of upper bounds on generalization error that are expressed in terms of the empirical distribution of the 
margin of the predictor (the classifier). The margin is defined as the product Yg(X). The bounds in question 
are especially useful in the case of the classifiers that are the combinations of simpler classifiers (that belong, 
say, to a class TL). One of the examples of such classifiers is provided by neural networks. Other examples are 
given by the classifiers obtained by boosting, bagging and other voting methods of combining the classifiers. 
The bounds in terms of margins are also of interest in application to generalization performance of support 
vector machines, Cortes and Vapnik (1995), Vapnik (1998), Bartlett and Shawe- Taylor (1999). The upper 
bounds have the following form (up to some extra terms) 



inf 

S>0 



where C{Q) is a constant depending on the class Q (in other words, on the method of combining the simple 
classifiers), <f> is a decreasing function such that 4>(S) — > oo as 5 — > (often, for instance, <j>{5) = 4), 
C(TL) is a constant depending on the class H (in particular, on the VC-dimension, or some type of entropy 
characteristics of the class). 

It was observed in experiments that classifiers produced by such methods as boosting tend to have 
rather large margin of correctly classified examples. This allows one to choose a relatively large value of S 
in the above bound without increasing substantially the value of the empirical distribution function of the 
margin (which is the first term of the bound) comparing with the training error. For large enough 6, the 
second term becomes small, which ensures a reasonably small value of the infimum. This allowed the above 
mentioned authors to explain partially (at least at qualitative level) a very good generalization performance 
of voting and some other methods of combining simple classifiers observed in many experiments. This also 
motivated the development of the methods of combining the classifiers based on explicit optimization of 
the penalized average cost function of the margins, see Mason, Bartlett and Baxter (1999), Mason, Baxter, 
Bartlett and Frean (1999). 

Despite the fact that previously developed bounds provide some explanations of the generalization per- 
formance of complex classifiers, it was actually acknowledged by Bartlett (1998), Schapire, Freund, Bartlett 
and Lee (1998) that the bounds in question have not reached their final form yet and more research is 
needed to understand better the probabilistic nature of these bounds. This becomes especially important 
because of the growing number of boosting type methods (see Friedman, Hastie, Tibshirani (2000), Fried- 
man (1999)) for which a comprehensive theory is yet to be developed. The methods of proof developed by 
Bartlett (1998) are based on the so called fat-shattering dimensions of function classes and on the extension 
of Vapnik-Chervonenkis type inequalities to such dimensions. The method of Schapire, Freund, Bartlett and 
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Lee (1998) exploits the fact that the complex classifiers are convex combinations of base classifiers (these 
authors suggest also an extension of their method to the classes of functions for which there exist so called 
e-sloppy ^-covering). The use of these methods in the case of general cost functions of the margins poses 
some difficulties (sec Mason, Bartlctt and Baxter (1999)). 

In this paper, we develop a new approach that allows us to improve and better understand some of the 
previously known bounds. Our method is based on the general results of the theory of Gaussian, Radcmacher 
and empirical processes (such as comparison inequalities, e.g. Slepian's Lemma, symmetrization and random 
multipliers inequalities, concentration inequalities, see Ledoux and Talagrand (1991), van der Vaart and 
Wellner (1996), Dudley (1999)). We give the bounds in terms of general functions of the margins, satisfying 
a Lipschitz condition. They can be readily applied to the classifiers based on explicit optimization of margin 
cost functions (such as in the paper of Mason, Bartlett and Baxter (1999)). In the case of Bartlett's bounds 
for feedforward neural networks in terms of the ^i-norms of the weights of the neurons (see Bartlett (1998) 
and also Fine (1999)), the improvement we got is substantial. In Bartlett's bounds the constant C(Q) is 
of the order (AL)^ +1 )/ 2 , where A is an upper bound on the ^i-norms of the weights of neurons, L is the 
Lipschitz constant of the sigmoids, and I is the number of layers of the network. Also, in his bound <j>(6) = jr- 
We obtained in a similar context C(Q) of the order (ALy with (f)(6) = |. 

Based on our bounds, we developed a method of complexity penalization of the training error of neural 
network learning with penalties defined as functionals of the weights of neurons and prove oracle inequalities 
showing some form of optimality of this method. 

We also obtained general rates of convergence of the empirical margin distributions to the theoretical 
one in the Levy distance. Namely, we proved that the empirical margin distribution converges to the true 
margin distribution with probability 1 uniformly over the class Q of classifiers if and only if the class Q 
is Glivenko-Cantelli. Moreover, if Q is a Donsker class, then the rate of convergence in Levy distance is 
0(n -1 / 4 ). Faster rates (up to 0(n~ x l 2 )) are possible under some assumptions on random entropies of the 
class Q. We give some examples, showing the optimality of these rates. 

We improved previously known bounds on generalization error of convex combinations of classifiers. 
In particular, our results in Section 3 imply that if the random ^-entropy of the class Q grows as e~ a for 
a E (0,2), then the generalization error of any classifier from Q with zero training error is bounded from 
above with very high probability by the quantity 

C 

n 2/(2+ a )p a /(2+a) ' 

where S is the minimal classification margin of the training examples and C is a constant. The previously 
known result of Schapire, Freund, Bartlett and Lee (1998) gives (up to logarithmic factors, for Q = conv(7i), 
Tt being a VC-class) the bound 0( ly^r ) which corresponds to the worst choice of a (a = 2). We introduce 
in Section 3 more subtle notions of 7-margin Sni^jjg) and empirical ^y-margin <5n(7i^) (parametrized by 
7 G (0, 1]) of a classifier g. These quantities allow us to obtain similar upper bounds on generalization error 
of the form 



n 1 -T/2&( 7 ;0)' 

in the case when the training error of the classifier g is not necessarily equal to 0. We call the quantity 

1 

the "{-bound of g. It follows from the definitions given in Section 3 that the 7-bounds decrease when 7 
decreases from 1 to 0. We prove that for any 7 > ^ with very high probability the 7-bounds are indeed 
upper bounds on the generalization error (up to a multiplicative constant C 7 ). 

The proof of the bounds of this type is based on the powerful concentration inequalities of Talagrand 
(1996a,b). For small a, the bound may become arbitrarily close to the rate (9(n _1 ), which is known to be the 
best possible convergence rate in the zero error case. In the case of convex combinations of classifiers from a 
VC-class H, one can choose a = 2(V — 1)/V, where V is the VC-dimension of the class H, which improves 
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Fig. 1. Comparison of the generalization error (dashed line) with the ^-bounds for ■y = 1,0.8 and 2/3 (solid lines, top to 
bottom) 



the previously known bounds for convex combinations of classifiers. We believe that these results can be of 
importance in some other learning problems (such as support vector learning, see Vapnik (1998)). 

Koltchinskii, Panchenko and Lozano (2000a, b) studied the behavior of the 7-bounds and some other 
bounds of similar type in a number of experiments with AdaBoost and other methods of combining classifiers. 
We have run AdaBoost for a number of rounds with a weak learner that output simple classifiers (e.g. decision 
stumps) from a small VC-class. In some of the experiments, we dealt with a toy learning problem ("intervals 
problem" ) for which it was easy to compute the generalization error precisely. In other cases, we dealt with 
real data from UCI Irvine repository (see Blake and Merz (1998)) and we estimated the generalization error 
based on test samples. In both cases, we computed the 7-margins and the corresponding 7-bounds based on 
the training data and compared the bounds with the generalization error (or with the test error). We give 
here only a short summary of the results of these experiments (and some related theoretical results). The 
details are given in Koltchinskii, Panchenko and Lozano (2000a, b). 

• One of the goals of the experiments was to determine the value of the constant C 7 involved in the 
7-margin bounds on generalization error. The results of Section 3 of this paper show that such a constant 
exists. Its size, however, is related to a hard problem of optimizing the constants involved in Talagrand's 
concentration inequality for empirical processes that was used in the proofs. Our experiments showed that 
the choice C 7 = 1 worked rather well in the bounds of this type. They also showed that the 7-bounds did 
improve the previously known bounds on generalization error of AdaBoost. The improvement was significant 
when the VC-dimension of the base class was small and, hence, the parameter 7 could be choosen much 
smaller than 1. Figure 1 shows a typical result of the experiments. 

• We also observed that the ratios ^"[^j of the empirical 7-margins to the true 7-margins of classifiers 
g produced by AdaBoost had been surprisingly close to 1 (at least for large sample sizes). The results of 
Section 3 imply that, with high probability, these ratios are bounded away from and from 00 uniformly 
in g e G for any 7 > ^r^- Recently, the first author proved that the ratios do converge to 1 uniformly in 
j £ 5 a.s. as n — > 00 for 7 > (the example was also given showing that for 7 = the ratios do not 
necessarily converge to 1 and for 7 < they can tend to 00). The closeness of the ratios to 1 explains why 
the 7-bounds are valid with C 7 = 1. 
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• In the case of the classifiers obtained in consecutive rounds of AdaBoost, the 7-bounds hold even for 
the values of 7 that are substantially smaller than the threshold given by the theory. It might be related 
to the fact that the threshold is based on the bounds on the entropy of the whole convex hull of the base class 
H. On the other hand, AdaBoost and other algorithms of this type output classifiers that belong to a subset 
Q C conv(7Y) whose entropy might be much smaller than the entropy of the whole convex hull. Because of 
this, it is important to develop adaptive versions of the margin type bounds on generalization error that 
take into account the complexity of the classifiers output by learning algorithms as well as their empirical 
margins. A possible approach to this problem was developed in Koltchinskii, Panchcnko and Lozano (2000a). 

It should be mentioned that that this paper describes only one of a number of growing areas of appli- 
cations of Probability to computer learning problems. Some other important examples of such applications 
are given in Yukich, Stinchcombe and White (1995), Barron (1991a, b), Barron, Birge and Massart (1999), 
Talagrand (1998), Freund (1995, 1999). 

2. Probabilistic bounds for general function classes in terms of Gaussian and Rademacher 
complexities Let (S, A, P) be a probability space and let J 7 be a class of measurable functions from 
(S, A) into M. [Later, in sections 5, 6 we will replace S by S x {—1, 1}, considering labeled observations; at 
this point, it is not important]. Let {Xk} be a sequence of i.i.d. random variables taking values in (S,A) 
with common distribution P. We assume that this sequence is defined on a probability space (f2, £,P). Let 
P n be the empirical measure based on the sample (Xi, . . . , X n ), 

n 

P n ^n" 1 ^^, 

i=l 

where 8 X denotes the probability distribution concentrated at the point x. We will denote Pf := J s fdP, 
Pn! ■= J s .fdPn, etc. 

In what follows, ^(T) denotes the Banach space of uniformly bounded real valued functions on T 
with the norm 

||Y||^:= sup|Y(/)|. 

We assume throughout the paper that T satisfies standard measurability assumptions of the theory of 
empirical processes (see Dudley (1999), van der Vaart and Wellner (1996)) (for simplicity, one can assume 
that T is countable, but this, of course, is not necessary). 

Our goal in this section is to construct data dependent upper bounds on the probability P{f < 0} and 
on the difference \P n {.f < 0} — P{f < 0}| that hold for all / S T with high probability. These inequalities 
will be used in the next sections to upper bound the generalization error of combined classifiers. The bounds 
will depend on some measures of "complexity" of the class T which will be introduced next. 

Define 

n 
i=l 

where {gi} is a sequence of i.i.d. standard normal random variables, independent of {Xi}. [Actually, it is 
common to assume that {gi} is defined on a separate probability space (f2 9 ,£ 9 ,P 9 ) and that the basic 
probability space is now (fixfij,SxS 9 ,PxP 9 )]. We will call n G n (!F) the Gaussian complexity function 
of the class T . 

Similarly, we define 

n 

i=i 

where {ei} is a sequence of i.i.d. Rademacher (taking values +1 and —1 with probability 1/2 each) random 
variables, independent of {Xi}. We will call n 1— > R n (T) the Rademacher complexity function of the class T. 

One can find in the literature (see, e.g., van der Vaart and Wellner (1996)) various upper bounds on 
such quantities as G n (T) and R n {T) in terms of entropies, VC-dimcnsions, etc. 
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First, we give bounds on P{f < 0} in terms of a class of so called margin cost functions. These bounds 
will be used in section 5 in the context of classification problems to improve recent results of Mason, Bartlett 
and Baxter (1999). 

Consider a countable family of Lipschitz functions $ = {ipk : k > 1}, where tp k ; K — > R are such that 
such that Z(_ 00 Qi(a;) < <fk{x) for all k. For each ip G will denote its Lipschitz constant. 

We assume that for any i£S the set of real numbers {f(x) : f G J 7 } is bounded. 



Theorem 1. For a// 1 > 0, 



»f 3/ G .F : P{f < 0} > inf \PnMf) + mVk)Rn{F) 



l0gfc\l/2 



:} < 2cxp{-2f 2 } 



G F : P{f < 0} > mf[p n ip k (f) + V2^L(<p k )G n (T) + ( 



logfc\V2 



t + 2 



| < 2cxp{-2t 2 }. 



Proof. Without loss of generality we can and do assume that each <p G $ takes its values in [0, 1] 
(otherwise it can be redefined as ip f\ 1). Clearly, in this case <£>(a:) = 1 for x < 0. For a fixed (/? G $ and for 
all / G F we have 



(2.1) 
where 



P{f < 0} < P<p(f) < PnVif) + \\Pn - Pile, 



Q v :={^o/-l:/G^}. 
By the exponential inequalities for martingale difference sequences (see pp 135-136), we have 

^{\\Pn - P\\g v > n\Pn - Pile, + ^} < cxp{-2t 2 }. 

Thus, with probablity at least 1 — exp{— 2t 2 } for all / G J 7 



(2.2) 



P{f < 0} < P n¥ >(/) + E||P„ - P\\g v + -p. 



The Symmetrization Inequality gives (,34.) 

n 

(2-3) E||P„ - P\\ Sv < 2E||n- 1 £ Ik- 

i=l 

Since a function (95 — 1)/L(<p) is a contraction and y>(0) — 1 = 0, the Rademacher comparison inequality 
(PI, Theorem 4.12, p.112) implies 

n n 

lUHn- 1 5^6,^11^ < 2L(^)E £ ||n- 1 ^£ l( 5xJ|^. 

i=l t=l 

It now follows from l|2.2ll . (|2.3|l that with probability at least 1 — e -2 * 2 we have for all / G F 



(2-4) 



P{f < 0} < P n p(/) + 4L(^)i?„(F) + -p. 



We use now (|2.4[1 with cp = <pk and £ replaced by f + Vlog fc to obtain 



(2.5) 



P{3/ G F : P{/ < 0} > inf + 4L(^ fe )P„(F) + ( '' " > ' " 

_2 

< exp{-2(t + ^/ioifc) 2 } < k 



-2 -2r 



< 2e 



fc>i 



fe>i 
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The proof of the second bound is quite similar with the following changes. The class Q v is defined in 
this case as {ip o / : / £ J 7 }. Instead of l|2.3[l . we have in this case, by the Symmetrization Inequality and 
Gaussian Multiplier Inequality (see [Sj, pp. 108-109, 177-179), that 

n n 

(2.6) m\P n -P\\s v <2E||n- 1 ^ e ^ Xl ||e, < V^rT 1 gft-fctj^. 

i=i i=i 

Define Gaussian processes 

n 

Z^/.a) := ra - 1 / 2 ^ J ,(^o/)(I 1 ) 
i=l 

and 

n 

Z 2 (f, a) := i^n- 1 / 2 ]T gJ(X t ) + ag, 
i=i 

where a = ±1 and g is standard normal independent of the sequence {gi}. If we denote by E g the expectation 
on the probability space (f2 g , E ff , ¥ g ) on which the sequence {gi} and g are defined then we have 

(2.7) E B \Zi(f,o) - Z 1 (h,a')\ 2 < E g \Z 2 (f,a) - Z 2 {h,a')\\ 

which is easy to observe if we consider separately the cases when a a' is equal to 1 and to — 1. Indeed, if 
a a' = 1 then 1)2. 7[) is equivalent to 

n n 

n" 1 $>(/(*)) ~ ^(HX l ))\ 2 < L( V fn- 1 Yllf( X i) h ^)] 2 

i=l i=l 

which holds since ip satisfies the Lipschitz condition with constant L(<p). If era' = — 1 then since < ip < 1 
we have 

n n 

E g \Z 1 (f,a)-Z 1 (h,a , )\ 2 < In' 1 £ ^ (/(*,)) + 271" 1 £ <^ 2 (M^)) < 

i=l i=l 

E(2 ff ) 2 <E 5 |Z 2 (/,a)-^,a')| 2 . 
A version of Slepian's Lemma (see Ledoux and Talagrand (1991), pp. 76-77) implies that 
E flH up{Zi(/,ff) :/e^, a = ±l] <E g sup{Z 2 (f,a) : f e T, a = ±l}. 

We have 

n n 

E 9 |K 1/2 X>^lk = E 9 ™p [n" 1 / 2 ^/^)] = E 9 sup{^(/, a) : f e ^, a = ±l}, 
i=i /»e6 v j_i 

where (? v := {cp(f),—<p(f) : f G J 7 }, and similarly 

n 

L^Eglln-^^^l^ + E^I >E 9 sup{Z 2 (/,a) i/eJ 7 , = ±l}. 
This immediately gives us 

n n 

(2.8) Eglln-^OiSxAg^ < L^Eglln-^diSxA^ + n- 1/2 E\g\- 

t=l z=l 

It follows from (|2.2|) . I|2.6|l and (|2.8|l that with probability at least 1 - e~ 2 * 2 

(2.9) P{f < 0} < P„^(/) + V^L{y)G n {F) + ±±2-. 
The proof now can be completed the same way as in the case of the first bound. 
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Let us consider a special family of cost functions. Assume that <p is a fixed nonincreasing function such 
that ip(x) > /(_ OO;0 ](a;) for x e R and ip satisfies Lipschitz condition with constant L(ip). Let 

$ :=M-/S) :Se (0,1]}. 

One can easily observe that L((p(-/S)) < L(ip)5~ 1 . For this family, Theorem 1 easily implies the following 
statement, which, in turn, implies the result of Schapire, Freund, Bartlett and Lee (1998) for VC-classes of 
base classifiers (see Section 5). 



Theorem 2. For all t > 0, 



and 



: { I/. ./•: /•{/• 0} ■ inf 



+ 



*e(o,i] 

^loglog 2 (2<J- 1 ) V/ 2 



+ -= ^ < 2exp{-2t 2 } 



B f 3/ e .F : P{/ < 0} > inf 

I- <5e(o,i] 



/ 2^L(y) 

^n^(^) H ^ (j n (F) 



+ 



loglog 2 (2(5- 1 )\i/2 1 t + 2 



) 1 + -^\ <2exp{-2t 2 }. 



Proof. One has to apply the bounds of Theorem 1 for the sequence </?&(•) := <p(-/Sk), where 5k = 2 
and then notice that for 5 € (<5fc, (5fc_i], we have 



^ < |, P^(f ) < ^(7) 
Ofc Ok 



and 



ylogfc = \/loglog 2 — < ^/loglog 2 7 . 



Remark. The constant 8 in front of the Rademacher complexity and the constant 2\J2ir in front of the 
Gaussian complexity can be replaced by 4c and \/27rc, respectively, for any c > 1 (with minor changes in the 
logarithmic term). Also, one can choose c = c(S), where c(S) = 1 + o(l) as <5 — > 0. 

In the next statements we use the Rademacher complexities, but Gaussian complexities can be used 
similarly. 

Assuming now that ip is a function from R into R such that <p>{x) < I(-oo.o]( x ) f° r all £ G R and <p still 
satisfies the Lipschitz condition with constant L(ip), one can prove the following statement. 



Theorem 3. For all t > 0, 



»{ 3/ e F : P{/ < 0} < sup (p„p(£) - ^^P n (.F) 



-( 



loglog 2 (2J- 1 )xi/ 2 \ t 



) )-^}- 2CXP{ ^ 2} ' 



Denote 



loglog 2 (25- 1 )xV 2 



) 



The bounds of theorems 2 and 3 easily imply that for alH > 



(3/ G ^ : P{f < 0} > P„{/ < 0} + inf [P„{0 < / < 6} + A n (T; 5)} + -L\ < 2cxp{-2t 2 } 
^ o£ (o,i] L J y n j 
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and 



\3f G T : P{f < 0} < P n {f < 0} - inf \p n {-8 < f < 0} + A n (f; <5)1 - -L\ < 2exp{-2t 2 }. 



To prove this it's enough to take <p equal to 1 for x < 0, for x > 1 and linear in between in the case of 
the first bound; in the case of the second bound, the choice of <p is 1 for x < — 1, for x > and linear in 
between. Similarly, it can be shown that 

Pf 3/ G T : P n {f < 0} > P{f < 0} + inf \p{0 < f < 5} + A n (f; 6)} + -L\ < 2cxp{-2t 2 } 
I 5e(o,i L J \/n) 



and 



Pf 3/ G T : P n {f < 0} < P{f < 0} - inf \p{-5 < f < 0} + A n (.F;5)l - -±=X < 2cxp{-2t 2 }. 
l. <5e(o,i]L J ynJ 

Combining the last bounds, we get the following result: 
Theorem 4. For all t > 0, 

_ ps f <r nil \ inf I p ji fi ^ a\ _i_ a r-r. x\\ j 



c {3/GJf: |P„{/ < 0} - P{/ < 0}| > inf \p n {\f\<S} + A n (T;S) 

I 56(0,1] L 



+■ -4=) < 4exp{-2i 2 } 



|3/G^:|P„{/<0}-P{/<0}|> inf [p{|/| < 5} + A n (.F; 5)1 + -^=) < 4cxp{-2t 2 }. 
I. 56(0,1] L J \/n) 



yfn. 



Denote 

ff,(<J) := 5P{\f\ < 8}, H nJ (6) := 5P„{|/| < 5}. 

Plugging in the second bound of Theorem 4 5:= HJ 1 {R n (T)) f\ 1 (we use the notation a/\b := min(a, b)) 
easily gives us the following upper bound that holds for any t > with probability at least 1 — 4e~ 2 * : 



V/ G T \P n {f < 0} - P{f < 0}| < 



9P„(^) , /loglog 2 (25- 1 )xi/2 t 



^ iogiog 2 (2<) 



+ 



(5 v n / \/n 

Similarly, the first bound of Theorem 4 gives that for any t > with probability at least 1 — 4e~ 2t 

9R n (T) | / loglog 2 (2 ( 5- 1 ) y/2 | t 
5 V n / -y/n 



V/ G ./•' |P„{/ < 0} - /'{/ < 0}| < — : 

with ( 5:= J ff i ;j(P„(^))Al. 

The next example shows that, in general, the term jR n (F) of the bound of Theorem 2 (and other 
similar results, in particular, Theorem 4) can not be improved. 

Let us consider a sequence {X n } of independent identically distributed random variables in l^ defined 

by 



X n = {4(21og(fc + l))-H , n>l, 

^ J k>l 



where e% are i.i.d. Rademacher random variables (P(e£ = ±1) = 1/2). We consider a class of functions that 
consists of canonical projections on each coordinate 

F = {fk ■ fk{x) = x k }. 

Let c/)(x) be an increasing function such that </>(0) = 0. Then the following proposition holds. 
Proposition 1. 

P| 3/ G T : P{f < 0} > M i][Pn{ f < 6} + + - * = } ^ 1 

when n — > oo uniformly for all t < 2~ 1 n 1 / 2 0((4n)~ 1/ ' 2 ) — c, where c > is some fixed constant. 
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Proof. It's well known that T is a bounded CLT class for the distribution P of the sequence {X n } (see 
Ledoux and Talagrand (1991), pp. 276-277). Notice that P(f k < 0) = 1/2 for all k and W^n' 1 £ e l 8 Xl \\r < 
cn -i/2 £ or some constant c > 0. Let us denote by t' = t + 2\/27rc. The infimum inside the probability is less 
then or equal to the value of the expression at any fixed point. Therefore, for each k we will choose (5 to be 
equal to a 8k > (21og(fc + l)) -1 / 2 . It's easy to see that for this value of <5, 



n 

p„{/ fc <4} = -^/(4 = -i 

71 ' ' 



n 

i=l 



Combining these estimates we get that the probability defined in the statement of the proposition is greater 
than or equal to 

r| 3t :i>lE'(4 = -i) + ^Ui-n^<^^ = - 1 » ' 

i<n ) k \ i<n 



In the product above factors are possibly not equal to 1 only for k in the set of indices 

/C = { fc:7 ^0(^-^- 

Clearly, 



»|i/2<n- i x;^i 

^ i<n 



where ko — [n/2 — Sn] — 1. For simplicity of calculations we will set ko = n/2 — Sn. Utilizing the following 
estimates in Stirling's formula for the factorial (see Feller (1950)) 

(2.10) (27r)3n n +5 e -" +1 /(i2«+i) < n! < (27r)5n"+ V ,l+1/12n 
it is straightforward to check that for some constant c > 

(2.11) r^2-">cri-3 ((l-2(5) 1 - 2 ' 5 (l + 2<5) 1+2 ' 5 )~ f > cn-5exp(-4n5 2 ). 

The last inequality is due to the fact that 

exp(x 2 ) < (1 - a:) 1- *^ + x) 1+x < exp(2x 2 ) 
for x < 2- 1 / 2 . It follows from f2~TT]) that 



\ i<n 



4 = -1) + lk> < 1 - cjT 1 ' 2 exp(-4n 7 ^) 



Since 7^ < 1/2 for k € /C, we can continue and come to the following lower bound 

1 - JJ (1 - cn" 1/2 exp(-4n7^)) > 1 - exp(- crT 1/2 exp(-4n 7 £)) 
keic keK. 

> 1 - cxp(-card(/C)c7i- 1 / 2 e - n ) -» 1, 
uniformly in t' , if we check that card(/C)cn -1 / 2 e~" — > 00. Indeed, if 

t' < 2- 1 n 1 / 2 0((4n)- 1 / 2 ) 

then for n large enough 

t' < 2- 1 n 1 / 2 ct)({An)- 1 / 2 ) < 2- 1 n 1 / 2 0((2 log([cne n ] + 1))" 1/2 ). 
It means that [cne n ] £ /C, and, therefore, 

card(/C)cn- 1 / 2 e-™ > n 1 ' 2 -J- ► 00. 

cn * e n 

Proposition is proven. 
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Remarks. If 4>{x) = a; 1- " for some positive a then the convergence in the proposition holds for 
t < cn a / 2 . Also, if i^p- — > oo as 5 — > 0, then the convergence in the proposition holds uniformly in t € [0,T] 
for any T > 0. It means that the bound of Theorem 2 does not hold with ^R n (!F) replaced by -^^R n {T). 
Similarly, one can show that 

p ja/ e T : |P„{/ < 0} - p{/ < o}| > 5 g tl] [Pn{\f\ <S} + ^jMF)] + ^} ^ 1 

when n — » oo uniformly for all f < 2~ 1 n 1//2 </>((4n)~ 1 / 2 ) - c. 



3. Conditions on random entropies and 7- margins Given a metric space (T, d), we denote Hd(T; e) 
the e-entropy of T with respect to d, i.e. 

H d (T;e) := log N d (T;e), 

where Nd(T;e) is the minimal number of balls of radius e covering T. Let dp n ,2 denote the metric of the 
space £2(£; dP n ) : 

d Pn Af,9) ~ {P n \f-9\ 2 ) 1/2 . 

The next theorems improve the bounds of previous section under some assumptions on the growth of 
random entropies Hd Pn , 2 (J 7 ; •)• We will use these results in section 5 to obtain an improvement of the bound 
of Schapire, Freund, Bartlett and Lee (1998) on generalization error of boosting. The method of proof is 
similar to the one developed in Koltchinskii and Panchenko (1999) and is based on powerful concentration 
inequalities of Talagrand (1996) (see also Massart (2000)). 

Define for 7 e (0, 1] 



and 



S n ( r , f) ■= sup{<5 e (0, 1) : PP{f <S}< n- 1+ i } 
S n ( T ,f) := sup{(5 e (0,1) : PP n {f <S}< n- 1+ i}. 



We call 5„(7; /) and 5„(7; /), respectively, the "/-margin and the empirical ^-margin of /. 

The main result of this section is Theorem 5 that gives the condition on the random entropy Hd Pni2 C^ 7 ! ') 
under which the true 7-margin of any / g T is with probability very close to 1 within a multiplicative constant 
from its empirical 7-margin. This implies that with high probability for all / g T 

P{/ <0}< 2^ . 

The bounds of previous section correspond to the case of 7 = 1. It is easy to see from the definitions 
of 7— margins that the quantity (n 1 ~ 7 / 2 <5„(7; /) 7 ) -1 (called in the introduction the 7-bound) increases in 
7 g (0, 1]. This shows that the bound in the case of 7 < 1 is tighter than the bounds of Section 2. 

Theorem 5. Suppose that for some a g (0, 2) and for some constant D > 
(3.1) H dp n , 2 (-T 7 ; u ) < Du^ a , u > a.s. 

Then for any 7 > , for some constants A,B>0 and for all large enough n 

P{v/g^: A" 1 4(7;/) <S n (r,f) < ^(7;/)} > l-Blog 2 log 2 ncxp{-n*/2}. 
The proof is based on the following result. 
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Theorem 6. Suppose that for some a € (0, 2) and for some constant D > condition 13.1]) holds. Then 
for some constants A, B > 0, for all 5 > and 



2+s 2 log n 
V — , 



and for all large enough n, the following bounds hold: 

P[3/ e T P n {f <5}<e and P{f < -} > Asj < B log 2 log 2 e" 1 exp{- — }. 

and 

P{3f e T P{f <5}<e and P n {f < S -} > Ae) < B log 2 log 2 e" 1 exp{-y }. 
Proof. Define recursively 



1, r fc+ i = C^/r^e f\ 1 



with some sufficiently large constant C > 1 (the choice of C will be explained later). By a simple induction 
argument we have either Ci/e > 1 and r k = 1, or C-y/e < 1 and in this case 

Tk _ ( ^l+2- 1 + ...+2- (fc - 1) £ 2- 1 + ...+2-' i _ £r2(l-2- fc ) g l-2 - * _ ^^?j2(l-2- fc ) 

Without loss of generality we can assume that C^fe < 1. Let 

7 fc := i /^=C 2 "- 1 £ 2 " fe " 1 . 

V r fc 

For a fixed S > 0, define 

4 = <5, 4 := 5(1 - 70 - ... - 7k- 1), 6 k t = ~(4 + 4+i), k > 1. 

' 2 z 

Warning. In what follows in the proof "c" denotes a constant; its values can be different in different 
places. 

Define J-q :— J-, and further recursively 



Fk+i ■= {/ G T h : P{f < 4,i} < r k+1 /2}. 



For fc > 0, let ipk be a continuous function from K into [0, 1] such that ip k (u) = 1 for u < 5 k i, ifk(u) — for 
w > 4, and linear for <5 fe i < u < 4- For fc > 1 let ^ be a continuous function from R into [0, 1] such that 
f' k {u) — 1 for u < 4, — for u > 4-i 3-! an d linear for 5k < u < <5 fe -i a- We have 

j2 7« = c- 1 [cvi + + . . . + (cvi) 2 "] 

i=0 

< C-\CV~er\l (CVer")- 1 < 1/2, 

for e < C~ 4 , C > 2(2 4 / 4 - l)" 1 and k < log 2 log 2 e" 1 . Hence, for small enough e (note that our choice of 
e < C~ 4 implies C^/e < 1), we have 

7o + . . . +7^ < ~, k > 1. 

Therefore, for all A; > 1, we get 4 G (6/2,6). Note also that below our choice of k will be such that the 
restriction k < log 2 log 2 for any fixed e > will always be fulfilled. 
Define 

£ fc ■= Wko f : f eT k }, k>0 

and 

^:=Ko/:/e^ fc }, fc>l. 
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Clearly, by these definitions, for k > 1 



and 



sup Pg 2 < sup P{f < 4} < sup P{f < 5 k _ t i} < r k /2 < r k 
g eg k fer h fer k 

sup Pg 2 < sup P{f < 8 k _ x i} < r k /2 < r k . 
geG' k feF h 

Since tq — 1 , for k = the first inequality becomes trivial. If now we introduce the following events 
E [k) ~{\\P n -P\\g k _ 1 <K 1 E\\P n -P\\g k _ 1 +K 2y /r k ^IE+K 3 e}f] 

n{ll P »- p lk <K 1 E\\P n -Py h +K 2 V^ + K 3 e'j, k>l, 

then it follows from the concentration inequalities of Talagrand (1996a, b) (see also that with some 
numerical constants K\, K 2 , -K3 > 

P((£?W)=) < 2e"Tr. 

Denote i?o = ^, 

AT 

Sjv := P| AT > 1. 

fc=l 

Then 

P(J5at) < 2Ne~^. 

In what follows we can and do assume without loss of generality that e < C~ 4 and therefore, r k+ i < r k 
and S k G (5/2, S\, k < log 2 log 2 e _1 . (If e > C~ 4 , then the bounds of the theorem obviously hold with any 
constant A > C 4 .) The following lemma holds. 

Lemma 1. Let N be such that 

(3.3) N < log 2 log 2 e^ 1 and rjv > £■ 

Let J = |inf/gjrP n {/ < 5} < s\. Then the following properties hold on the event E^C\Cf : 

(i) V/ G T P n {f <6}<s=^feT N 

and 

(it) sup P n {f < 4} <r k , < k < N. 

Proof. We will use the induction with respect to AT. For N = 0, the statement is obvious. Suppose it 
holds for some N > 0, such that N + 1 still satisfies condition (|3.3(l of the lemma. Then on the event En f] 3 
we have 

sup P n {f < 4} < r k , < k < N 
/en 

and 

V/ G .F P„{/ < 5} < e => / G .Fjv. 

Suppose now that / G is such that P n {/ < <5} < e. By the induction assumptions, on the event En, we 
have / G ^at. Because of this, we obtain on the event En+i 

P{f < S N x} < P n {f < M + \\Pn- P\\g N 

(3.4) <e + K 1 E\\P n -P\\g N +K 2 ^¥ N l + K 3 e. 
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For a class (?, define 



Rn{Q) ■= \\n 1 ^e^xJIe, 



where {si} is a sequence of i.i.d. Rademachcr random variables. By the symmetrization inequality, 

(3.5) E||P„ - P\\g N < 2EI EN E e R n {G N ) + 2EI E$f E E R n {g N ). 

Next, by the well known entropy inequalities for subgaussian processes (see van der Vaart and Wellner (1996), 
Corollary 2.2.8), we have 

( 2su P 3e6jv -P"9 2 ) 1/2 

(3.6) E £ R n {G N ) < 



inf E e \n- i y"e j g(X j )\ + -= \ 
By the induction assumption, on the event En f] J 



H d'p ,{Qmu)du. 



inf E £ \n~ 1 ^2 £ j9(^j)\ < inf E^n" 1 J^E^Xjf < 4= ^ 

.7=1 .7 = 1 



<^ inf VP n {/ < ^} < — inf VP„{/< J} <\-<e 



We also have on the event £jv H ^ 



sup P n g 2 < sup P n {/ < 5 N } < r N . 

9&9n /eJ^jv 



The Lipschitz norm of tpk-i an d f'k 1S bounded by 



L = 2(4-i - 4)- 1 = 2<5- 1 7 -_ 1 1 = i / rfc - 1 



5 V e 



which implies the following bound on the distance 



¥n ° f;PN ° g) =n' 22^(p N {f(X j ))-(p N (g{X j )) < — J d 2 Pn2 {f,g) 



i=i 



Therefore, on the event Pat f| J' 



P-P !E5 /.9 2 )" ! 1/2 1 -far"'! 1 "• 

2 (GN-,u)du < —= 



l/2-a/4 1/2 



(3.7) < c(^) 



where we used the fact that condition i|3.2fl of the theorem implies 

1 2 + v 

< £ 4 . 



n l/2ga/2 

It follows from (|3.6fl . I|3.7|l that on the event Pat + i f| 

(3.8) E e P„(£w) < c^FaTF. 

Since we also have 

E e R n {g N+1 ) < i, 

and JjOJ yield 

EUPn-PlI^ < <V^+ 2P(£ft) < C Vn^ + 47Ve" l£ / 2 . 

Since 4iVe~™ £ / 2 < £ (it holds due to the conditions (|3.2|) and (|3.3|) . for all large enough n) we conclude that 
with some constant c > 

®\\P n -P\\g N <<Vn^. 
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Now we use (|3.4[) and see that on the event E N+ i f] J 

(3.9) P{f< Vi}<c( £ +Vw). 

Therefore, it follows that with a proper choice of constant C > in the recurrence relationship defining the 
sequence {r^}, we have on the event E^ + i f] J 

P{f < Vi} < ~<Vw = r N+1 /2. 

' 2 z 

This means that / € Tn+i and the induction step for (i) is proved. This will now imply (ii). We have on the 
event -Ejv+i 

sup P n {f < 6 N+1 } < sup P{f <6 N ^} + \\P n -Py N+i 

(3.10) < r N+1 /2 + KiE\\P n - P\\g> N+1 + K 2 ^rJ^e + K 3 e. 
By the symmetrization inequality, 

(3.11) E||P„ - P\\g> s+1 < 2EI EN E e Rn(g' N+1 ) + 2EI E$f E e R n {g' N+1 ). 
As above, we have 

n ^.(2sup eg , P n g 2 ) 1/2 

(3.12) E E R n (G' N+1 )< inf E £ \n- l Y J ^9{X j )\+^= / H x J p \{Q' N+1 -u)du. 

Since we already proved (i) it implies that on the event -E/v+i P| J 

n n 1 

inf E^n' 1 Y^e ]9 (X 3 )\ < inf E^n" 1 V £ j5 (A^)| 2 < -= inf y^R^f 

< 4= inf JP n {f<S N i} < inf y/P n {f < 6} < ./I < e 

By the induction assumption, we also have on the event -Eat+i (~| 

sup P„g 2 < sup P n {f < 8 N> i} < r N . 

9&S' N+1 f£?N 

The bound for the Lipschitz norm of ip' k gives the following bound on the distance 

dp^ivN+l f;^N+i° 9) = n^ 1 Y^<p' N+1 o f(Xj) - tp' N+1 og(Xj) < QJ-f) d P n M^)- 
Therefore, on the event -Ejv+i f] 3 ■> we S e ^ quite similarly to l|3.7|) 



1 /-(2sup ,„/ P n g 2 ) 1/2 



(2r N ) 



1/2 



,1/2 rF 



(3-13) < ^r"^ 

We collect all bounds to see that on the event E^+i f] J 
(3.14) 



sup P„{/ < <5at + i} < -^ii + c^r N e. 



Therefore, it follows that with a proper choice of constant C > in the recurrent relationship defining the 
sequence {?"&}, we have on the event -Ejv+i p| J 

sup P„{/ < (5jv+i} < Cy/r N e = rjy+i, 
which proves the induction step for (ii) and, therefore, the lemma is proved. 
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To complete the proof of the theorem, we have to note that the choice of N — [log 2 log 2 £ _1 ] implies 
that rjv +1 < ce for some c > 0. The second inequality of the theorem can be proved similarly with some 
minor modifications. 

□ 

Proof of Theorem 5. Consider sequences Sj := 2~ 3 ~ , 

1 2 

where a' := > a. The first inequality of Theorem 6 implies 

P{3j > 3/ P n {f < Sj} < e 3 and P{f < 8^2} > A'ej} < 

(3.15) < S'log 2 log 2 n2jexp{ — 2 2 ^} < B log 2 log 2 n exp{ — } 

with some B, B', A' > 0. If for some j > 1, we have 

4(7;/) g 

then by definition of <5 n (7; /) 

Pn{/ < a,-} < 

Suppose that for some / € T the inequality A & n (j; /) < S n (j; f) fails. Then, it follows from the definition 
of S n (-f; f) that 

P{f < 5,12} > P{f < ^i} > (-±-)»&A*& > A'ej, 

where the last inequality holds for the proper choice of a constant A. Hence, ()3.15f) guarantees the probability 
bound for the left side inequality of the theorem. The right side inequality is proved similarly utilizing the 
second inequality of Theorem 6. 



□ 



4. Convergence rates of empirical margin distributions As we defined in Section 2, T is a class 
of measurable functions from S into R. For / 6 T, let 

F f (y) := P{f < y}, F nJ (y) := P n {f <y},ye M. 

Let L denote the Levy distance between the distribution functions in R : 

L(F, G) := inf{<5 > : F(t) < G(t + 5) + 5 and G(t) < F{t + 6) + 5, for all t G R}. 

In what follows, for a function / from S into R and M > 0, we denote /m the function that is equal to 
/ if |/| < M, is equal to M if / > M and is equal to -M if / < —M. We set 

T M ■= {/m : / G T}. 

As always, a function i 7, from 5 into [0, +oo) is called an envelope of T iff |/(x)| < F(x) for all / G T and 
all x <E S. 

We write G GC(P) iff is a Glivenko-Cantelli class with respect to P (i.e. ||P„ — P|| ^ — * as n — * oo 
a.s.). We write G BCLT(P) and say that satisfies the Bounded Central Limit Theorem for P iff 

n\Pn-P\W = 0{n- 1 ' 2 ). 

In particular, this holds if T is a P-Donsker class (see Dudley (1999), van der Vaart and Wellner (1996) for 
precise definitions). 

Our main goal in this section is to prove the following results. 
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Theorem 7. Suppose that 
(4.1) sup P{\f\ > M} -> as M -> oo. 

TTien, £/ie following two statements are equivalent: 

(i) T M G GC(P) for all M > 

and 

(m) sup L(F n f,Ff) — > a.s. as n — > oo. 



Theorem 8. The following two statements are equivalent: 

(i) T G GC{P) 

(ii) there exists a P-integrable envelope for the class = {/ — Pf : f G J 7 } and 

sup L(F n f,Ff) — ► a.s. as n — ► oo. 



Theorem 9. Suppose that the class T is uniformly bounded. If T G BCLT(P), then 

sup L(F n> f,Ff) — Op(n -1 / 4 ) as n — > oo. 

Moreover, if for some a G (0, 2) and /or some I? > 

(4.2) if dPni2 (jF; u) < Dm"", u > a.s., 

i/ien 

sup L(F n f,Ff) — 0(n _T ^) as n ^ oo a.s. 
far 

The following theorem gives the bound that plays an important role in the proofs. 

Theorem 10. Let M > and let J 7 be a class of measurable functions from S into [—M,M]. For all 
t > 0, 

p{supL(F„,/,Jf» > 2(E\\n- 1 Y,^x t \\r + + ^} < cxp{-2t 2 }. 

Proof. Let S > 0. Let </?(x) be equal to 1 for x < 0, for x > 1 and linear in between. One can get the 
following bounds: 

— ) < PM— 



F f (y) = P{f <y}< Pvihr 1 ) ^ PnAhr 1 ) + H p « - p ks 



<F nJ (y + S) + \\P n -P\\g 5 

and 

Fn,f(y) = Pn{f <y}< PnVi 1 -^ 1 ) < + II - P\\g s 

<F f (y + 5) + \\P n -P\\g s , 

where 

&:={r(^)-l:/e^e[-M,M]}. 
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Similarly to the proof of Theorem 1 we get that with probability at least 1 — 2e -2 ' 2 
(4.3) \\Pn-P\\g s < -^Wn-^EiSxtWr + Mn- 1 ' 2 

2 — 1 

Setting 



t 

Jn 



S := 2{K\\n- 1 ^2e l 8 Xz \\r + Mn- 1 ' 2 ^ 



we get that with probablity at least 1 — exp{— 2t 2 } 



^ 1 / 2 / 

■L(F nJ ,F f ) <2(E\\n- 1 Y J ^x l \\r + Mri- 1 ' 2 ) + 

i=l v n 

which completes the proof. 



sup . 



Proof of Theorem 7. First we prove that (i) implies (ii). Since J-m G GC(P), we have 

E||P n - P\\r M —> as n -> oo, 
which, by symmetrization inequality, implies 

n 
i=l 

Plugging in the bound of Theorem 10 t = logn and using Borel-Cantelli Lemma proves that for all M > 
sup L(F n j M ,Ff M ) = sup L(F n j,Ff) — > as n — > oo a.s. 

The following bounds easily follow from the definition of Levy distance: 

sup L(F,, sup P{|/|>M} 



and 



sup L(F n j, F nifM ) < su P P„{|/| > M}. 



By condition (|4.1|) of the theorem, 



sup L(Ff,Fj AI ) — > as M — > oo. 



To prove that also 



it is enough to show that 



lim limsup sup L(F n j, F n j M ) = a.s. 



(4.4) lim limsup sup P n {\f\ > M} = a.s. 

M-kx> n ^oo f e yr 

To this end, consider the function <p from M. into [0, 1] that is equal to for |u| < M — 1, is equal to 1 for 
|u| > M and is linear in between. We have 

su P P„{|/| > M} = sup P n {|/| > M} < sup PM\f\) 

(4.5) < sup P^(|/|) + ||P„ - P|| e < sup P{|/| > M - 1} + ||P„ - P|| e , 
where 
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Since tp satisfies the Lipschitz condition with constant 1, the argument based on symmetrization inequality 
and comparison inequalities (see the proofs above) allows one to show that the condition (i) implies that 

E \\ p n - -Pile -> as n -> oo. 

Then, the standard use of concentration inequality implies that 

||P„ — P\\g — > as n — > oo a.s. 

Therefore, (|4.4|l immediately follows from condition 14. 1|) and l|4.5|l . Now, the triangle inequality for the Levy 
distance allows one easily to complete the proof of (ii). 

To prove that (ii) implies (i), we use the following bound 

rAf 



td(F-G)(t)\<cL(F,G), 

-At 



-At 

which holds with some constant c = c(M) for any two distribution functions on [-M, M]. The bound implies 
that 

(4.6) \\P n -Py M = sup \P n f-Pf\= sup | / td(F nJ - F f )(t)\ < c sup L(F nJ ;F f ). 

fe^M fe^M J-m fer M 

Since for all M > and for all / € T it is easily proved that 

(4-7) L(F n j M ,Ff M ) < L(F n j, Ff), 

the bound (|4.6J) and condition (ii) imply (i), which completes the proof of the second statement. 

□ 

Proof of Theorem 8. Since centering does not change Levy distance and does not change Glivenko- 
Cantelli property we can start by assuming that T is centered, i.e. T — T^°'. To prove that (i) implies (ii), 
note first of all that the condition T G GC(P) yields that T = T^ has a P-integrable envelope (see van 
der Vaart and Wellner (1996), p. 125). Also, the existence of a P-integrable envelope implies JOJ. Finally, if 
T € GC(P), then for all M > Tm € GC(P) [To prove this claim note that Jm = fM /, where ifM is the 
function from R into [-M, M] that is equal to u for |u| < M, is equal to M for u > M and is equal to — M 
for u < —M. The function ifM is Lipschitz with constant 1 which allows to prove the claim by the argument 
based on the comparison inequality and used many times above]. We can use Theorem 7 to conclude that 
(i) implies (ii). On the other hand, if (ii) holds then by the inequality (|4.7|l we get that 

sup L(F n j, Ff) — > as n — > oo a.s. 

As we pointed out above l|4.1|l holds, so, by Theorem 7, we have Tm G GC(P) for all M > 0. The integrability 
of the envelope of the class T allows us to conclude the proof of (i) by a standard truncation argument. 

□ 

Proof of Theorem 9. Since T is uniformly bounded, we can choose M > such that Tm = T. To 
prove the first statement note that T € BCLT(P) means that 

E||P„-P||^ = 0(n- 1 / 2 ). 

which implies 

n 

E\\n- 1 J2^x z y = 0(n- 1 / 2 ). 
i=i 

Thus, the bound of Theorem 10 implies that with some constant C > 

P{sup L(F n j, Ff) > (4= + ^) V2 + -J=} < exp{-2t 2 }. 
fer V n V n V n 

It follows that 

lim limsupP{n 1/4 sup L(F n j,Ff) > u} = 0. 
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To prove the second statement, we follow the proof of Theorem 10. We use Rademacher symmetrization 
inequality to get the bound 

E||P„-P||^ <2ER n {§ s ) 
and then use the entropy inequalities for subgaussian processes (see Corollary 2.2.8) to show that 



MGs) < inf E £ 
g£<3s 



- 1 



i=i 



y/2 sup 



H 



du 



< 



V2 



H 



du. 



To bound the random entropy Hd Pn 2 1 we use the Lipschitz condition for the function (p. It yields (via a 
standard argument based on constructing minimal covering of the class T with respect to the metric dp nt 2 
and of the interval [—M,M] with respect to the usual distance in real line and "combining" the coverings 
properly) the following bound: 

AM 



H dPn , 2 [Gs; uj < H dPn 2 [T- 6u/2j +±og—. 



Therefore, we get (with a proper constant c > 0) 

•V2 



E e Rn(g s ) < 



du 



, 4M 1 
log — + 1 



which, under the condition (|4.2|l . is bounded from above by ^— ^ Q y 2 . Thus, we proved the bound 

E||P n - < 



Arguing now the same way as in the proof of Theorem 10, we can show that with probability at least 
1 - exp{-2£ 2 }, 



su V L(F nJ ,F f ) <<5\/— |— 



Plugging in the last inequality 



i ) 

n 2 + o 



we get 



p{supL(F ni/ ,.F» > ~A~ + < exp{-2i 2 }. 
By choosing t := logn and using Borel-Cantelli Lemma, we complete the proof of the second statement. 



Remark. It's interesting to mention that the condition T G GC(P) does not imply that 



BUpBUp|F n)/ (i)-J>(i)| 

/e^teR 







with probability 1, which is equivalent to saying that the class of sets {/(/ < t) : f 6 T , t 6 R} is GC(P). 
As an example, consider the case when S is a unit ball in an infinite-dimensional separable Banach space. 
Let T be the restriction of the unit ball in the dual space on S. For i.i.d. random variables {X„} in S, we 
have, by the LLN in separable Banach spaces, 



\Pn-py ■= 



- iy £(Xj-EX) 



a.s. 
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so T £ GC(P). On the other hand, there exists an example of a distribution P such that Tt GC(P), 
where Tt is the class of all halfspaces (see Sazonov (1963) and also Tops0e, Dudley and Hoffmann- J0rgensen 
(1976)). Hence, 

sup sup \F nJ (t) - F f {t)\ = \\P n - P\\ H 

does not converge to a.s. 

In the next proposition, we are again considering the class T used already in Proposition 1 and the 
sequence of observations {X n } defined by 

X n = lel!(21og(fc + l))-i-'n , n>l. 



{ £ £(21og(fc + l))-^} 

^ J k>l 



where (3 := — — ^, a € (0, 2] and e£ are i.i.d. Rademacher random variables. The proposition shows the 
optimality of the rates of convergence obtained in Theorem 9. 

Proposition 2. Consider the sequence S n such that 

sup L(F n j, Ff) — Op{8 n ). 

Then 

d n > cn~ 2 +° 

(when a = 2, we have S n > en -1 / 4 /. On the other hand, for a £ (0, 2), we have 

H dPn2 {T]u) < Du~ a , u>0 

and 

sup L(F n j, Ff) = 0(n~^) a.s.; 
for a = 2 we have T G BC'LT(P) and 

supL(F n , f ,Ff) = Op(n-i). 

Proof. We can assume without loss of generality that with probability more than 1/2 for all k > 1, 
y 6 [—1,1] and n large enough we have 

(4.8) P(f k <y)<Pn(fk<y + S)+6. 

If we take y = and consider only such k that satisfy the inequality (21og(/c + 1))' 3 + 1 / 2 < S^ 1 then 14.8(1 
becomes equivalent to 



l/2<n- 1 ^I(4 = -l) + 5. 

Inequality (21og(fc + l))^ +1 /2 < g-i holds for k < ipi(5) = l/2exp(<5~ iW/2). Therefore, for large n 
1/2 < Pi _ 

lk<V>i(<5) i<n 

V>l(<5) / , „ \ 



(4.9) 

where ko = [n/2 — 5n] — 1. Using (|2.11() . we get 



j f) {l/2<n- iy £,I(ei = -l) + 5}\ 

|i/2<^gi ( 4=-i) + ^ 1 <(i-(; o ) 2 -^ 



2 WxWi < l _ cn = exp(-4n<5 2 ) 
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Taking logarithm of both sides and taking into account that log(l — x) < —x we get (recall that ipi(8) — 
l/2exp(J~Tw/2)) 

cxp(-2~ 1 (F W) > cnT^ cxp(-4n<5 2 ). 

Therefore, 

l/(25 2/(1+2/3) ) < AnS 2 + c\ogn 

and 

1/2 < 4n<5 4 ( 1+ «/( 1+2 « + c5 2 /( 1+2/3 ) logn. 

This finally implies that 

d > cn •'u+m = cn 2 +° . 

The second statement follows from Theorem 9. To check condition (|4.2|) . note that in this case, as soon as 
2 log AT > {u/2)- a , we have \f k (X n )\ < u/2 for all k > N and n > 1. Hence, 

d Pn AfkjN) <u, k>N 

and we have 

H dPn2 (T-u) < log AT, 

which implies (£Q1>. For a = 2, we also have T G BCLT(P) (see Ledoux and Talagrand (1991), pp. 276-277). 
Theorem 9 allows one to complete the proof. 



5. Bounding the generalization error of convex combinations of classifiers In this and in the 
next section we consider applications of the bounds of Section 2 to various learning (classification) problems. 
We start with an application of the inequalities of Section 2 to bounding the generalization error in general 
multiclass problems. Namely, we assume that the labels take values in a finite set y with card(J^) = M. 
Consider a class T of functions from S := S x y into R. A function / £ T predicts a label y £ y for an 
example x £ S iff 

f(x,y) > max/(ar,j/)- 
The margin of a labeled example (x, y) is defined as 

m f( x ,y) : = f(x,y) - max f(x, y'), 
v T^y 

so / misclassifies the labeled example (x, y) iff rrif(x,y) < 0. Let 

F:={f(;y)--yey,fef}- 

The proof of the next result is based on the application of Theorem 2. 
Theorem 11. For all t > 0, 

P{a/ £ f : P{m f < 0} > inf \p n {m f < 5} + 8M ( 2A * ~ ^ R n (T) 



loglog 2 (2 < 5- 1 )xi/2 



< 2exp{-2t 2 }. 

'n J 



To prove the theorem, we use the following lemma. 
For a class of functions Ti, we will denote by 

H {1) = {max(/i!, . . . , hi) : h x , ■ ■ ■ , h £ H}. 
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Lemma 2. The following bound holds: 

n n 

< 21E\\ ]T 



t=l i=l 



Proof. Let x + :=iV 0. Obviously x i— > x + is a nondecreasing convex function such that (a + b) + < 
a + + b + . We will first prove that 

n n 

(5.1) E(supV £ ^(X i ))+ < ffi(su P V eih(Xi))+. 

Let us consider classes of functions J-i, J-jj an d 

^ = {max(/i, h) : A G T x , h € ^ 2 }. 



Since 



we have 



max 



(/i,/ 2 ) = ~((/i + / 2 ) + |/i-/ 2 |) 



F i=l 

< e( sup E e <oCfiW + AW) + ^p E^olAW - ^ x 

1 " + 1 ™ 

< -E(sup ^ £ i(/i(^) + /2(^i))) +^(sup ^ Ei |/ 1 (X i )-/ 2 (X i )| 

1 K ^-^ i=l 1 ^1^2 i= l 

. + 



£i/ 2 (Xi 



1 n + 1 ™ 

< -E^up^Ei/xCXi)) + -e(su P ]T 

^ i=l ^ 2 = 1 

+-e(su P J2 s M x i) ~ M x i)\) ■ 

The proof of Theorem 4.12 in [21] contains the following statement. If T is a bounded subset of R™, functions 
</?i, z = 1, . . . , n are contractions such that fi(0) = and a function G : R — > R is convex and nondecreasing 
then 

n n 

EGfsup Ve^^ti)) < EGfsup Vea). 

If we take G{x) = x+, <f>i(x) = \x\ and T = {(fi(X t ) - / 2 pQ))? =1 : fi E Fi, f 2 & ^2} we get (first 
conditionally on (Xi)^ =1 and then taking expectations) 



E 



ft 1 Tt 

sup iTeilMXi) - f 2 (Xi)\) < E( sup Y^SiihiXi) - / 2 (X ( )) 



< E(sup^e J / 1 (X 4 )) +E(sup^e. i / 2 (X l 

where in the last inequality we used the fact that the sequence (— £i)™=i i s equal in distribution to (ffi)™ =1 . 
Combining the bounds gives 



E 



(sup^ejpf,)) <E(sup^e J / 1 (X 4 )) +E(sup^e i / 2 (X l 



24 



Now by induction we easily get l|5.1|) . Finally, again using the fact that (— £i)™ =1 is equal in distribution to 
(£j)™ =1 , we conclude the proof: 



E|| y><&rJ w co <E supVe,^) + E(- sup V e l h(X i ) 
= 2E(supV Sih(Xi)) < 2/EfsupV Sih(Xi)) < 21E\\ V e^xjw- 



Proof of Theorem 11. We have the following bounds: 



E sup 



'X V"/: 



, sup 

fey' j=i yey 



v} 



< ^ E sup n 1 £ j m f( x j^y) I {Y J = ! 
yey Sef j=1 

< 2 E SU P e J m /( X j^)( 2/ {^=y} ~ 1) + 2 E SU P n ~ l X! £ J m f( X J^y) ■ 

yey 3=1 yey Sef j=1 

Denote <Jj{y) := 2I{y = y } — 1- Given {(Xj, Yj) : 1 < j < n}, the random variables {ej<jj(y) : 1 < j < n} are 
i.i.d. Radcmachcr. Hence, we have 



E sup 



fey s =i 



- 1 J2e J m f {X J ,y)(2I {Yj=y} - 1) 



= E sup 



i&F i= i 



EE £ sup 



fey 3=1 



~ l Y2, e i a i(y) m f( x 3iy) 



= EE e sup 



fey j=1 



E sup 



Therefore, we have 



, sup 



fey 3=1 
Next, using Lemma 2, we get for all y e y 



1 J2sj m f( x j^ Y j)\ < E sup n 1 X ejiTifiX^y) 
yey ft? j=i 



E sup 



/e^ 3=1 



< Esup 

fey 



< Esup 

fey j i 

n 
3=1 



sup 

f e jr(M-l) 



E sup 

fey 3=! 

3=1 



1 £j max /(Xj , 



This implies 



< (2M - 1)E sup n" 1 V ^-/(Xj-) 
fey 



3 = 1 



E sup 



3=1 



-^EjmfiX^Yj)] < ]T(2Af-l)Esup » 1 



3=1 



M(2M - 1)E sup n" 1 V e 3 f(X 3 ) 
fey ' 



3=1 
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and the result follows from Theorem 2 (one can use in this theorem the continuous function <p that is equal 
to 1 on (—oo,0], is equal to on [l,+oo) and is linear in between). 

□ 

In the rest of the paper, we assume that the set of labels is { — 1,1}, so that S := S x { — 1,1} and 
T :={/:/ g T}, where f(x,y) := yf(x). P will denote the distribution of (X,Y), P n the empirical 
distribution based on the observations ((X\, Yi), . . . , (X n ,Y n )). Clearly, we have 

n n 

R n (F)=Esup\n- 1 Y / e l Y l f(X l )\=EE E sup In" 1 V e,/(X 4 ) I , 

& is. & is 

where £» := Yj£j. Since, for given {(Xi,Yi)}, {ii} and {si} have the same distribution, we get 

n n 

E e sup (n- 1 'Ve l f{X l )\ = E £ sup V e l f(X l ) | , 
& iS f& l=l 

which immediately implies R n (T) = R n (T). 

The results of Section 2 now give some useful bounds for boosting and other methods of combining the 
classifiers. Namely, we get in this case the following theorem (compare with the recent result of Schapire, 
Freund, Bartlctt and Lee (1998)). 

Given a class Ti of measurable functions from S into R, we denote conv(7i) the closed convex hull of 
7i, i.e. conv(7i) consists of all functions on S that are pointwisc limits of convex combinations of functions 
from H : 

N 

conv(W) := {/ : Vx e S f(x) = ]imf N (x), f N = J2 w f h 7 > 

3=1 

N 



> 0, ^2 w f = x ' h f eH > N > 1 }- 



3=1 



Let tp be a function such that tp(x) > I(_ oo ](a;) for all x € R and tp satisfies the Lipschitz condition 
with constant L (tp). 

Theorem 12. Let T := conv(7i), where H is a class of measurable functions from (S, A) into R. For all 
t > 0, 

Pf 3/ G T : P{f < 0} > inf \p n pd) + ^^R n (H) + 
y. <5e(o,i]L o o 

^(26-^t, 
\ n / J v'nJ 

Proof. Since T := conv(H), where H is a class of measurable functions from (S, A) into R, we have 

n 

R n (f)=E\\n- 1 Y, £ i 6x *y 

i=l 

n N N 

= E sup{ In" 1 tifN(Xi)\ :fN=Yl w f h ^ w f ^ °' E w f = h h? e H, N > l} 

i=l j=l j=l 

n 

= E\\n- 1 Y,^x,\\n =Rn(H). 
i=i 

It follows that R n (T) — R n {^H), and Theorem 2 implies the result. 
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In the voting methods of combining the classifiers (such as boosting, bagging (Breiman (1996)), etc.), 
a classifier produced at each iteration is a convex combination fs E conv(7i) of simple base classifiers from 
the class H (fs depends on the training sample S := ((X\,Yi), . . . , (X n , Y n )j). The bound of Theorem 12 
implies that for a given a e (0, 1) with probability at least 1 — a 

'loglog 2 (2 ( 5- 1 )xi/ 2 l , t a 



P{fs<0}< inf P n {f s < 8} + -R n (H) 
5e(o,i]L o 



where t a := J \ log In particular, if 7i is a VC-class of classifiers h : S i— > { — 1, 1} (which means that the 
class of sets {{x : h(x) = +1} : h e H} is a Vapnik-Chervonenkis class) with VC-dimension V(Ti), we have 
with some constant C > 



V(H) 



Rn{H) < C 

This implies that with probability at least 1 — a 

C IV(H) , /loglog 2 (25- 1 )Ni/ 2 



P{fs<0}< inf P n {fs<6} 
5e(0,i] L 



6 



1 i 2 

2n a 



which slightly improves the main bound of the paper of Schapire, Freund, Bartlett and Lee (1998), which 
has a factor \og(n/V{H)) in front of the term CS-^VW/n) 1 / 2 . 

Example. In this example we consider a popular boosting algorithm called AdaBoost. At the be- 
ginning (at the first iteration) AdaBoost assigns uniform weights = 



to the labeled observations 



(Xi, Yi), . . . , (X n , Y n ). At each iteration the algorithm updates the weights. Let 
denote the vector of weights at fc-th iteration. Let P n w m be the weighted empirical measure on the fc-th 
iteration: 



: =E 



AdaBoost calls iteratively a base learning algorithm (called "weak learner") that returns at fc-th iteration a 
classifier hk &H and computes the weighted training error of hk ■ 



e k := P n 



{y ^ h k }. 



(In fact, the weak learner attempts to find a classifier with small enough weighted training error, at least 
such that ek < 1/2). Then the weights are updated according to the rule 



(fe+i) , = wf ) expj-YjakhkjXj)} 
Zk 



w 



where 



Z k := w j ex P{- Y i a khk(Xj)} 



and 



otk := 



After N iterations AdaBoost outputs a classifier 



1, 1 

o lo g- 

2 e k 



J2k=i a kh k (x) 
T,k=l a k 



The above bounds, of course, apply to this classifier since fs € conv(H). Another way to use Theorem 12 
in the case of this example is to choose a decreasing function ip, satisfying all the conditions of Theorem 12 
with L(ip) = 1 and such that tp(u) < e~ u for all u £ K. It is easy to see that such a choice is possible. Let us 
also set 
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Then it is not hard to check that 



( * ^ k ) - ^(yE^^W) - cx P{~y y^a k h k {x)}. 



Therefore 

fs N 
S ^ 

A simple (and well known in the literature on boosting, see e.g. Schapire, Freund, Bartlett and Lee (1998)) 
computation shows that 

N N 



k=l 



P n cxp{-y^2a k h k (x)} = ]J 2^e k (l - e k ). 

We also have 

a k = log 11 



N N 



1 - e k 



e k 



fe=l k=l 

It follows now from the bound of Theorem 12 that with probability at least 1 — a 

N N n 

1 - e k 

e k 



P{fs < 0} < J] Ve^l - e k ) + 8 (log J[ J \J l)R n {H) 



k=l k=l 



H dp 2 (conv(H); u) < sup H dQ 2 (conv(H);w) < Du ~ 
Qev(s) 



/ log i og2 (2 ( log uti v^f v i) ) \ n — 2" 

+ ( n J + V^ l0g a- 

The results of Section 3 provide some improvements of the above bounds on generalization error of 
convex combinations of base classifiers. To be specific, consider the case when H is a VC-class of classifiers. 
Let V := V(H) be its VC-dimcnsion. A well known bound on the entropy of the convex hull of a VC-class 
(see van der Vaart and Wellner (1996), p. 142) implies that 

2(V-1) 

J I < f Cl l 1 I V f H i ' // 1 < I ) II 

' QGP(S) 

[The bound on the entropy of a convex hull goes back to Dudley; the precise value of the exponent was given 
by Ball and Pajor, van der Vaart and Wellner, Carl; in the case of the convex hull of a VC-class, the above 
bound relies also on Hausslcr's improvement of Dudley's original bound on the entropy of a VC-class. See 
the discussion in the books of van der Vaart and Wellner (1996) and Dudley (1999) and references therein.] 
It immediately follows from Theorem 5 that for all 7 > 2 ^~^ and for some constants C, B 

p{3/econv(W) : P{f < 0} > ° } < Blog 2 log 2 ncxJ-\nl }, 

where 

S n ( T , /) := sup{«5 e (0, 1) : S^P n {(x, y) : yf(x) < 5} < n^+i}. 

This shows that in the case when the VC-dimension of the base is relatively small the generalization error 
of boosting and some other convex combinations of simple classifiers obtained by various versions of voting 
methods becomes better than it was suggested by the bounds of Schapire, Freund, Bartlett and Lee (1998). 
One can also conjecture, based on the bounds of Section 3, that outstanding generalization ability of these 
methods observed in numerous experiments can be related not only to the fact that they produce large 
margin classifiers, but also to the fact that the combined classifier belongs to a subset of the whole convex 
hull for which the random entropy H dp 2 is much smaller than for the whole convex hull. 
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Finally, it is worth mentioning that the bounds in terms of the so called margin cost functions (see e.g. 
Mason, Bartlett and Baxter (1999), Mason, Baxter, Bartlett and Frean (1999)) easily follow from Theorem 
1. Namely, Theorem 1 implies that with probability at least 1 — a 

N>il V n V n / J V 2n a 

where {<pn} is any sequence of Lipschitz cost functions such that <^n{x) > /(- O o,o]( a; ) f° r all a; G K, TV > 1 
and Ln is a Lipschitz constant of <pw 



6. Bounding the generalization error in neural network learning We turn now to the applica- 
tions of the bounds of previous section in neural network learning. We start with the description of the class 
of feedforward neural networks for which the bounds on the generalization error will be proved. Let H be 
a class of measurable functions from (S, A) into R (base functions). Consider an acyclic directed graph G. 
Suppose that G has a unique vertex Vi (input) that has no incoming edges and a unique vertex v Q (output) 
that has one outcoming edge. The vertices (nodes) of the graph will be called neurons. Suppose the set V of 
all the neurons is divided into layers 

l 

V = {v i }u\JV j , 

1=0 

where I > and Vi — {v Q }. The neurons Uj, v Q are called the input and the output neurons, respectively. The 
neurons of the layer Vq will be called the base neurons. Suppose also that the inputs of the base neurons are 
the outputs of the input neuron. Suppose also that the inputs of the neurons of the layer Vj , j > 1 are the 
ouputs of the neurons from the set Ufc=o ^fe ■ To define the network, we will assign the labels to the neurons 
the following way. Each of the base neurons is labeled by a function from the base class H. Each neuron of 
the jth layer Vj, where j > 1, is labeled by a vector w := (u>i, . . . , w n ) € R™, where n is the number of inputs 
of the neuron, w will be called the vector of weights of the neuron. 

Given a Borel function a from M into [—1,1] (a sigmoid) and a vector w :— (u>i, . . . , w n ) € M™, let 

n 

N<,, w : M" M, N atW (u\, . . . ,u n ) := af^^WjUj). 

i=l 

For w e R", 

n 

Hk ==X>i|. 

i=l 

Let (j j : j > 1 be functions from R into [—1,1], satisfying the Lipschitz conditions: 

\o-j(u) — <Tj(v)\ < Lj\u — v\, u,i>€R. 

The network works the following way. The input neuron inputs an instance x G S. A base neuron 
computes the value of the base function (it is labeled with) on this instance and outputs the value through 
its output edges. A neuron in jth layer (j > 1) computes and outputs through its output edges the value 
N aj W (ui, ...,u n ) (where u\, . . . ,u n are the values of the inputs of the neuron). The network outputs the 
value f(x) (of a function / it computes) through the output edge. 

We denote Ni the set of all such networks. We call Mi the class of feedforward neural networks with 
base H and I layers of neurons (and with sigmoids {<Tj}). Let A/^ := Ujlo-^i- D ennc Ho := H, and then 
recursively 

Hj :={N <rjtW {h 1 ,...,h n ) :n>0,/i i S« j -i, w € R"} (J Hj-i. 

Denote Hoo '■= Ujlo^i- Clearly, Hoc includes all the functions computable by feedforward neural networks 
with base H. 
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Let {Aj} be a sequence of positive numbers. We also define recursively classes of functions computable 
by feedforward neural networks with restrictions on the weights of neurons: 

H j (A 1 ,...,A j ) := 

:= {N ajtW {h l ,...,h n ) :n>0,h i eH j -i(A 1 ,...,A j _ 1 ), w e M n , \\w\\ tl < Aj}\J 

U«M( A i ^-i)- 



Clearly, 



Hj :={j{H j (A 1 ,...,A j ) :A u ...,Aj ■ • x }. 



As in the previous section, let ip be a function such that (p(x) > /(_ OO;0 ](a;) for all x e K and 95 satisfies the 
Lipschitz condition with constant L(ip). 
We start with the following result. 

Theorem 13. For allt>0 and for all I > 1 

F\sf€Hi(A 1 ,...,A l ):P{f<0}> inf \P n <pd) + 2 ^ L ^ TUlLjAj + l)G n {H) 
1. 5e(o,i] L d 

+ ^}<2exp{-2t 2 }. 
Proof. We apply Theorem 2 to the class T = 7iz(Ai, . . . , A{) =: 7iJ, which gives for alH > 



'loglog 2 (2 ( 5- 1 )xV 2 



«€[0,1] 



+ 



t + 2- 



< 2exp{-2i 2 }. 



Thus, it's enough to show that 



i= 1 j — 1 2—1 

To this end, note that 



(6.1) 
where 



i=l 



i=l 



i=l 



ft := {^^(fti,...,^) : n> 0,/i 4 G . . . , A-i), 

Consider two Gaussian processes 

n 
i=l 

n 

Z 2 (f) ~ Lm-^^gJiXi), 



w e K , kt> 



^ <i4j}. 



and 



where 



n 

f € j^^i/ii : n > 0,/ii e 



k <A ( }=:ft. 
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We have 

n 

E^f) - Z l (h)\ 2 = n- 1 Wl(f(Xi) - <n(h(Xi))\ 2 

i=l 

71 

< Lin- 1 \f(*i) - H*i)\ 2 = E,|Z a (/) - Z 2 (h)\\ 

i=l 

By Slepian's Lemma (see Ledoux and Talagrand (1991)), we get 

n n 

(6.2) EgWn-^^SxM =n- 1 / 2 E 9 ||Z 1 || e; < 2n- 1 / 2 E g \\Z 2 \\g 1 = 2L l E g \\ n - 1 ^ 9i S Xi He; . 

z=l i—1 

Since = A;conv s (7i;_i) [here conv s (C7) denotes closed symmetric convex hull of a class G, i.e. closed convex 
hull of the class TL U —TL] , it is easy to get that 

n n 

(6-3) EWn-^grSxMg; = A l E\\n- 1 Y,9 l SxMn^ 1 - 

It follows from the bounds 1|> — (|t>.3|) that 

n n 

Mn-^giSxtWn, < (2£,4 + l)E||n- 1 £ftfrJtt,_ 1 . 

i=l i=l 

The result now follows by induction. 

□ 

Remark. It can be shown that in the case of multilayer perceptrons (in which the neurons in each 
layer are linked only to the neurons in the previous layer) the factor Ylj—i(^LjAj + 1) in the bound of the 
theorem can be replaced by Yli=i(2LjAj). If the sigmoids are odd functions, the same factor in the case 
of general feedforward architecture of the network becomes Y[j=i{LjAj + 1), and in the case of multilayer 
perceptrons FL—i LjAj. Bartlett (1998) obtained a bound similar to the first inequality of Theorem 13 for 
a more special class TL and with larger constants. In the case when Aj = A, Lj = L (the case considered 



by Bartlett) the expression in the right hand side of his bound includes — , which is replaced in our 

( A T 1 ; 

bound by § ■ These improvement can be substantial in applications, since the above quantities play the 
role of complexity penalties. 

Given a neural network / £ A/"oo, let 

1(f) := min{j > 1 : / € Afj}. 

Let {b k } be a sequence of nonnegative numbers. For a number fc, 1 < k < 1(f), let Vk(f) denote the set of 
all neurons of layer k in the graph representing /. Denote 

W k (f) := ^max \J b k , k = 1, 2, . . . , 1(f), 

and let 

«(/) 

A(/) :=]J(4L k W k (f) + l), 

k=l 



. 2 

fc=i 



T M) log(2+|log 2 WM/)l), 



where a > is a number such that C(°0 < 3/2, C being the Riemann zeta-function: 

oo 

((a) :=$>"<*. 



k=l 
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Theorem 14. For allt > and for all a > such that ((a) < 3/2, the following bounds hold: 

p{a/ e Woo : P{f < 0} > r inf JPM*) + + 2 ^ L{ip) Hf)G n (H) 



+ 



«e(o,i)L ■ S 5 
loglog 2 (2 ( 5- 1 )xi/2 1 r Q (/)+i + 2 



| < 2(3 - 2C(a)r 1 exp{-2t 2 }. 



Proof. With a little abuse of notations, we write / for both the neural network and the function it 
computes. Denote 



The conditions £(f) = I and 



easily imply that 



. f [2 fe -\2 fc ) for k £ Z,k ^ 0,1 
fc '~\[l/2,2) forfc = l. 



Wj(f)eA kj , %ez\{0}, j = i,...,l 



A(/) > Y[(2L^ + 1), r a (/) > E Vf log(l ^ 1 + 1} 

and also that / G Hi(2 kl , . . . , 2' Cl ). Therefore, the following bounds hold: 



"{3/ e Woo : P{/<0} > r jnf JP n¥ >(^) + 2 , L{ip) A(f )GJ H ) 



56(0,1) 



+ 



loglog 2 (2 ( 5- 1 )xi/2 1 r a (/)+t + 2 



^ 10glOg 2 ^d ^ 



+ 



OO 

^E E 



. ^ ¥{3f€H 00 f]{f:l(f) = l,W j (f)eA kj ,j = l,...,l}: 
;=Ofe!eZ\{o} fe,eZ\{o} 



P{/ < 0} > a Jg [p„^) + «A(/)G„(H) 



+ 



loglog 2 (2J- 1 )xi/2 1 r Q (.f)+t + 2 



+ 



OO » 

^E E ••• E P{3/eW(2 fcl ,...,2 fc '):P{/<0}> 5 inf [p n¥ ,(£) 



+ - 



2\/27rL(v?) 



n(2L,2 fe ,+l)G„(W)+( 



loglog 2 (2J- 1 )xi/2 



+ 



EUVf log(|M + 1)+* + 2 



} 



Using the bound of Theorem 13, we obtain 



c f3/ e Woo : P{/ < 0} > inf 

I 56(0,1) 



+ 



loglog 2 (2 ( 5- 1 )xi/2 1 r Q (.f)+t + 2 
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< 



E E ••• E 2exp{-2(^^log(|^| + l) + t) 2 } 
i=Ofc l£ z\{o} fc;eZ\{o} i=i 



^E E ••• E 2cx P {-E«i°g(N + i)-2t 2 } 



;=Ofe l£ z\{o} fe ; eZ\{o} 



i=i 



= 2 E E ••■ E 11(1^1 + l)—e X p{-2t 2 } 



;=Ofe ie z\{o} fc,6Z\{o)j=i 



2 EIl( 2 E fc ^) cx p{- 2t2 } = 2 Et 2 (c(«) - i)]'cx P {-2t 2 } 

(=0 j=l k=2 1=0 



2(3 - 2C(a))- 1 cxp{-2t 2 } 



which yields the bound of the theorem. 



It follows, in particular, that for any classifier fs € Woo, based on the training data S := 
((X 1 ,yi),...,(X n ,y n )),wehave 



»f P{/s < 0} > inf 

I 56(0,1) 



Pn V (f) + ^^A(/ 5 )G B (W)- 



1/2 1 + IAM±1±1\ < 2(3 _ 2f (a)) -i cxp{ ^ 2 }. 

Next we consider a method of complexity penalization in neural network learning based on the penalties 
that depend on £i-norms of the vectors of weights of the neurons. Suppose that fs is the neural network 
from T C 7ioo that minimizes the penalized training error 



fs ~ argmin /e ^ inf P n ({f < 5}) + 

0G(U,1J L 



2V27T 



Hf)G n (H) + ( 



iogio g2 (2 ( 5- 1 ) y/2i | rjj) 



= argmin /e ^ P n ({f < 0}) + inf n n (f; S) , 
1 L <5e(0,i] J 

where the quantity inf^^i] Tr n (f', S) plays the role of the complexity penalty, 

tt„(/; S) := P n ({0 <f< 6}) + *„(/; <5), 



*„(/;<*) : = H^ A (/)G n (W) + ( 



iogio g2 (2 ( 5- 1 ) y/2 | r Q (/) 



We define a distribution dependent version of this data dependent penalty as inf^o.i] ir n (f; 5), where 

7T„(/; 5) := P({0 < f < 26}) + 2* n (/; <5). 

The first inequality of the next theorem provides an upper confidence bound on the generalization error of 
the classifier fs- The second bound is an "oracle inequality" that shows that the estimate fs obtained by the 
above method possess some optimality property (see Johnstone (1998), Barron, Birge and Massart (1999) 
for a general approach to penalization and oracle inequalities in nonparametric statistics). 



Theorem 15. For allt > and for all a > with ((a) < 3/2, the following bounds hold: 

P|P{/5 < 0} > ini :[P n {f < 0} + inf ^„(/;5)] <2(3-2C( a ))- 1 cxp{-2t 2 } 
l. /GJ 7 <5e(o,i] y" ' 



{P{fs < 0} - inf P{.g < 0} > mf[P{/ < 0} - inf P{.g < 0} + inf *„(/; <5) 
.<?e^ /e^L .ije^ 7 <5e(o,i] 



2t + 4 



+ -i±-} < 4(3 - 2C(a))- 1 exp{-2t 2 }. 



Proof. The first bound follows from Theorem 14 and the definition of the estimate f$. To prove 
second bound, we repeat the proof of Theorems 1, 2 to show that for any class T' 



»{3/ G T' 36 G (0, 1] : P n {f <S}> 



+ 



doglog 2 (2^ 1 )xV 2 



+ ^}<2cxp{-2t 2 }. 



The argument that led to Theorems 13 and 14 shows that 



2V2tt 



'{3/ G T 35 G (0, 1] : P n {f < 5} > [p{f < 25} + Zl^L A (f)G n (H) 



If now 



+ 



loglog 2 (2,5- 1 )xV 2 , r a (/) t + 2 



+ l±±\ } < 2(3 - 2CM)- 1 cxp{-2t 2 }. 



inf inf 

/e^ 56(0,1] 



Pn({/<^}) + *n(/^) 



+ 



t + 2 



then 



> inf inf 

/e^ 56(0,1] 



P{f<25} + 2^ n (f;S) 



+ 



2t + 4 



3f eT35e (0, 1] : P n {f < 6} > [P{f < 26} + *„(/; <5) 
Combining this with the first bound gives 



+ ■ 



t + 2 



r r~ ^ 2t + \ \ 

F{p {fs < 0} > M s inf^ [P{/ < 25} + 2* B (/; *)] + - + 



< 4(3-2C(a)) _1 exp{-2t 2 }, 



which implies the result. 
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